notebook.community

Edit and run



In [1]:

    
import pandas as pd
import numpy as np
import networkx as nx

There's a whole bunch of hubway data available over at the Hubway Data Challenge website.

There's already been a bunch of visualizations submitted, especially since the submission date was way back in 2012 (I think).

I decided I'd try my hand, especially since sigma.js seems like such a useful visualization tool.

I also wanted to see if a community detection algorithm on the graph would discover anything interesting (or otherwise obvious) about the communiting patterns of hubway users.

The pipeline looks like this:

Read dataset from the hubway site into pandas
Create a networkx directed graph representation
Run community detection
Export it to .gexf
Play around with the graph until something nice appears
Export THAT graph to .gexf, then play around with sigma.js until something else nice appears.

Okay, here we go:

First, import all the station and trip data from hubway.

Here's what they look like after pulling them into pandas:



In [2]:

    
stations = pd.read_csv('../data/stations_10_12_to_11_13.csv', 
                       index_col=0)

trips = pd.read_csv('../data/hubwaydata_10_12_to_11_13.csv',
                    index_col=0,
                    parse_dates=['start_date', 'end_date'])









    



/Users/dan/attic/tools/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:1070: DtypeWarning: Columns (9) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)



In [125]:

    
# sigma.js requres the node IDs to be strings.
stations = stations.set_index(stations.index.map(str))
trips.start_station = trips.start_station.map(str)
trips.end_station = trips.end_station.map(str)



In [128]:

    
stations.head(2)









    Out[128]:






  
    
      
      terminal
      station
      municipality
      lat
      lng
      nb_docks
      install_date
      last_day
    
  
  
    
      3
       B32006
            Colleges of the Fenway
       Boston
       42.340021
      -71.100812
       15
       4/8/2013
       11/27/2013
    
    
      4
       C32000
       Tremont St. at Berkeley St.
       Boston
       42.345392
      -71.069616
       15
       4/8/2013
       11/27/2013
    
  

2 rows × 8 columns



In [127]:

    
trips.head(2)









    Out[127]:






  
    
      
      status
      duration
      start_date
      start_station
      end_date
      end_station
      bike_nr
      subscription_type
      zip_code
      birth_date
      gender
    
    
      id
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      708191
       Closed
       240
      2012-10-31 23:57:00
       53
      2012-11-01 00:01:00
        67
       B00412
       Registered
       02139
      NaN
         Male
    
    
      708190
       Closed
       840
      2012-10-31 23:54:00
       36
      2012-11-01 00:08:00
       115
       B00491
       Registered
       02127
      NaN
       Female
    
  

2 rows × 11 columns

Create the graph

Next, we've gotta create the graph.

To do this, I iterate through all the rows of the stations dataframe, adding nodes to the graph as I go.

I weight an edge between two stations $n_1$ and $n_2$ by the number of trips taken from $n_1$ to $n_2$, $t_{1,2}$.

(now that I'm writing this, I should probably normalize these weights by the total number of trips leaving a station in general. I'll update the notebook later.)



In [129]:

    
DG = nx.DiGraph()

# add station nodes
# sigma.js requires a string ID, an X, and a Y for each node.
for n, r in stations.iterrows():
    DG.add_node(n,
                label=r.station,
                y=r.lat,
                x=r.lng)
    
# add edges for trips.
# There's an edge between two stations if there was a trip between them.
# Edges are weighted by the number of trips
for n, r in trips.iterrows():
    if DG.has_edge(r.start_station, r.end_station):
        DG[r.start_station][r.end_station]['weight'] += 1
    else:
        DG.add_edge(r.start_station, r.end_station, weight=1)

Here's what the gross networkx visualization looks like (with the Fruchterman-Reingold layout):



In [140]:

    
pos = nx.fruchterman_reingold_layout(DG)
nx.draw_networkx(DG, pos)

Community detection

There's a nice little python module called communities that implements the Louvain community detection method.

I'm pretty ignorant to best practices here, so I just ran it a few times until I saw a layout in the induced community graph (image #2).



In [131]:

    
import community

DG = nx.Graph(DG)
partition = community.best_partition(DG)

size = float(len(set(partition.values())))
pos = nx.spring_layout(DG)
count = 0
for com in set(partition.values()) :
    count = count + 1.
    list_nodes = [nodes for nodes in partition.keys()
                                if partition[nodes] == com]
    nx.draw_networkx_nodes(DG, pos, list_nodes, node_size = 20,
                                node_color = str(count / size))


nx.draw_networkx_edges(DG,pos, alpha=0.8)
plt.show()


# let's see if it makes sense:
ind_graph = community.induced_graph(partition, DG)

#pos = nx.spring_layout(ind_graph)
pos = nx.fruchterman_reingold_layout(ind_graph)

nx.draw_networkx(ind_graph, pos)

Great. Now that we've run the community detection, we can then add a 'community' attribute to each node, and then export it to .gexf for use in gephi.



In [132]:

    
# add the communities to the graph
for n, d in DG.nodes_iter(data=True):
    DG.node[n]['community'] = partition[n]



In [ ]:

    
nx.write_gexf(DG, '../sigma/trips.json')

Messing around with the network is beyond the scope of this post/notebook.

In Gephi, I did the following things:

Color-coded the communities
removed edges with a weight less than 50, representing routes that had fewer than fifty total trips in the dataset.
applied a geo-layout to keep the geographic data of the stations intact.

You can see the finished product produced with sigma.js here

	terminal	station	municipality	lat	lng	nb_docks	install_date	last_day
3	B32006	Colleges of the Fenway	Boston	42.340021	-71.100812	15	4/8/2013	11/27/2013
4	C32000	Tremont St. at Berkeley St.	Boston	42.345392	-71.069616	15	4/8/2013	11/27/2013

	status	duration	start_date	start_station	end_date	end_station	bike_nr	subscription_type	zip_code	birth_date	gender
id
708191	Closed	240	2012-10-31 23:57:00	53	2012-11-01 00:01:00	67	B00412	Registered	02139	NaN	Male
708190	Closed	840	2012-10-31 23:54:00	36	2012-11-01 00:08:00	115	B00491	Registered	02127	NaN	Female